Climate change is a very big problem in contemporary times, and India, as the second most populous country in the world, contributes significantly to air pollution with many cities in the top 50 most polluted cities in the world. But the extent of pollution is not properly studied and used in decision making in India. Air quality monitoring stations are present in many cities but the data isn’t used for predictions in most cases, but rather as an observation. Another problem is that these AQI monitoring stations aren’t present in small cities. The project aims to build a forecasting model to predict AQI. The project focuses on the city of Indian cities like Delhi, Mumbai and Chennai which are infamous for their pollution levels.
The plots on the side reveal how developing countries (Low Income Mediterranean, South East Asia) contribute most to the particulate pollution. Moreover the other plots reveal how India compares to the rest of the world
LMIC: Low Middle Income Countries
HIC: High Income Countries
The plots on the side show the amounts of pollutants and their spread across various cities in India. Particulate Matter is not shown here. The pollutants shown here make up significant portion of the Air Quality Index which is a widely used measure. We will be using the AQI as well in this project
By looking at the autocorrelation function (ACF) and partial autocorrelation (PACF) plots we can tentatively identify the numbers of AR and/or MA terms that are needed.ACF plot is merely a bar chart of the coefficients of correlation between a time series and lags of itself. The PACF plot is a plot of the partial correlation coefficients between the series and lags of itself.
We have used the algorithm proposed by Hyndman & Khandakarin 2008 which is implemented through auto.arima in R. This simplifies the process but increases time complexity. Since we have a relatively small data set we have gone ahead with this.
SARIMA is a Integrated AR, MA model fitted for daily/monthly values as well as an additional seasonal term. We have fitted the model using auto.arima.
The SARIMAX model has some regressors as well, to model the variable change with time. This model was fitted using auto.arima but the results were unfavorable due to missing values and compounding of errors.A Distributed Lag model was also checked but gave worse results.
The Ljung Box test of the residuals show if the residuals are white noise or if they have predictability. The squared residual plot checks conditional heteroscedasticity.
There are various error measures that can be used for model selection - AIC, BIC, MSE, MAE, MAPE, etc. We have used MAPE. We have chosen the most robust model as that is also an important criteria in selection.
The Chennai model has a lot of missing values and hence cannot be predicted properly, hence the apparent lack of seasonality. We drop Chennai from our analysis.
[1] "Delhi"
Series: AQID.ts
ARIMA(1,1,2)(0,1,0)[365]
Coefficients:
ar1 ma1 ma2
0.5423 -0.6921 -0.2647
s.e. 0.0347 0.0383 0.0334
sigma^2 estimated as 4716: log likelihood=-9282.91
AIC=18573.83 AICc=18573.85 BIC=18595.45
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.7725529 62.04998 41.97265 -4.096785 20.52469 0.5027082
ACF1
Training set -0.003184295
[1] "Mumbai"
Series: AQIM.ts
ARIMA(1,0,0)(0,1,0)[365] with drift
Coefficients:
ar1 drift
0.7150 -0.0262
s.e. 0.0339 0.0121
sigma^2 estimated as 675.2: log likelihood=-1972.82
AIC=3951.64 AICc=3951.7 BIC=3963.78
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.01404403 18.98299 9.353563 -1.96953 9.687569 0.3399684
ACF1
Training set 0.02590205
[1] "Chennai"
Series: AQIC.ts
ARIMA(4,1,3)
Coefficients:
ar1 ar2 ar3 ar4 ma1 ma2 ma3
1.5983 -0.8616 0.1788 -0.0069 -1.9292 1.162 -0.2259
s.e. 0.2212 NaN NaN 0.0138 0.2212 NaN NaN
sigma^2 estimated as 1228: log likelihood=-9579.86
AIC=19175.72 AICc=19175.8 BIC=19220.23
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -1.28413 34.9666 22.85284 -7.48814 20.17663 0.4631568 -0.001062613
These ARIMAX models are done using all the meteorological variables. This gives models with larger MAPE than the ARIMA model without an X variable/ matrix. This can be due to high deviations in the meteorological variables. A distributed lagged model was also fitted to check. ARIMA with no X variable is the best in any case.
[1] "Delhi"
Series: ydl
Regression with ARIMA(1,0,0) errors
Coefficients:
ar1 temp pressure humidity wind
0.8746 1.8201 0.0421 0.2311 0.3091
s.e. 0.0191 0.5682 0.0214 0.1363 0.3253
sigma^2 estimated as 426.9: log likelihood=-3498.19
AIC=7008.37 AICc=7008.48 BIC=7036.38
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.0194588 20.59614 13.4242 -3.261841 12.46601 1.011254 0.03280456
[1] "Mumbai"
Series: ydl
Regression with ARIMA(2,1,2) errors
Coefficients:
ar1 ar2 ma1 ma2 temp pressure humidity wind
1.0552 -0.4096 -1.2309 0.3260 -1.2183 0.3911 -0.2039 0.0950
s.e. 0.1460 0.0964 0.1531 0.1417 1.0391 0.7103 0.1454 0.4229
sigma^2 estimated as 390.3: log likelihood=-3456.66
AIC=6931.33 AICc=6931.56 BIC=6973.33
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.07747973 19.64383 12.64135 -2.087419 11.39747 0.9522809
ACF1
Training set -0.001610694
[1] "Chennai"
Series: ydl
Regression with ARIMA(1,1,1) errors
Coefficients:
ar1 ma1 temp pressure humidity wind
0.6278 -0.9694 -1.2233 -0.2736 0.1754 -0.0954
s.e. 0.0248 0.0098 1.3770 0.8662 0.2518 0.5131
sigma^2 estimated as 1406: log likelihood=-7557.43
AIC=15128.87 AICc=15128.94 BIC=15166.05
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -1.553327 37.40857 25.00667 -8.123409 21.2297 0.9668808
ACF1
Training set 0.0009152456
Box-Ljung test
data: ARIMAXD.out$residuals
X-squared = 6.5218, df = 10, p-value = 0.7697
Box-Ljung test
data: ARIMAXD.out$residuals^2
X-squared = 49.033, df = 1, p-value = 2.517e-12
Box-Ljung test
data: ARIMAXM.out$residuals
X-squared = 15.527, df = 10, p-value = 0.114
Box-Ljung test
data: ARIMAXM.out$residuals^2
X-squared = 125.9, df = 1, p-value < 2.2e-16
Box-Ljung test
data: ARIMAXC.out$residuals
X-squared = 14.726, df = 10, p-value = 0.1424
Box-Ljung test
data: ARIMAXC.out$residuals^2
X-squared = 142.62, df = 1, p-value < 2.2e-16
In this project we have identified the SARIMA function using Hyndman & Khandakar algorithm in auto.arima to be the best model for our problem. Due to the limitations of computing power and relatively low quality of data, we were only able to forecast for Delhi and Mumbai. With distributed computing, this can be overcome. The model can be automated and scaled with better quality data to make an interactive dashboard for policy makers to make decisions and to track climate change and fluctuations. Technologies like HDFS, Map Reduce with R can be on a cloud service provider used for scaling.
The Ljung Box test confirmed conditional heteroscedasticity. GARCH models were fitted but the standard deviations were too high. The very nature of this data is highly volatile. Therefore we have obtained a reasonable prediction. Usage of a fourier curve for the seasonal component or modern techniques like LSTM Neural Nets might give better results.
# This code was used to scrape meterological data from the web
data<-NULL
for ( year in 2019:2020)
{
for(month in 1:12)
{
monthname<-c('january','february','march','april','may','june','july','august','september','october','november','december')
if (monthname[month] %in% c('january','march','may','july','august','october','december')){j=31}
else {j=30}
if(month==2&year%%4==0){j=29}
else if (month==2&year%%4!=0){j=28}
for(day in 1:j)
{
url<-read_html(paste0('https://en.tutiempo.net/records/vidp/',day,'-',monthname[month],'-',year,'.html'))
temp.raw<-url%>%
html_nodes('.Temp')%>%
html_text()
temp.raw<-gsub('[^\x20-\x7E]', '', temp.raw)
humid.raw<-url%>%
html_nodes('.hr')%>%
html_text()
humid.raw<-str_remove(humid.raw,'%')
wind.raw<-url%>%
html_nodes('.wind')%>%
html_text()wind.raw<-str_remove(wind.raw,' km/h')
wind.raw<-na.omit(str_extract(wind.raw, '[[:digit:]]+'))
pressure.raw<-url%>%
html_nodes('.prob')%>%
html_text()
pressure.raw<-str_remove(pressure.raw,'hPa')
humidity<-mean(na.omit(as.integer(humid.raw)))
temp<-mean(na.omit(as.integer(temp.raw)))
pressure<-mean(na.omit(as.integer(pressure.raw)))
wind<-mean(na.omit(as.integer(wind.raw)))
Date<-as.Date(paste0(year,'-',month,'-',day))
datatemp<-data.frame(Date,temp,pressure,humidity,wind)
data<-rbind(data,datatemp)
}
}
nametemp<-paste('data',year,sep='_')
assign(nametemp,data)
}
data2<-rbind(data_2015,data_2016,data_2017,data_2018,data_2019,data_2020)
write.csv(data2,'data2.csv', row.names = FALSE)---
title: 'Air Quality Prediction'
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
theme: flatly
social: [ "twitter", "facebook", "menu" ]
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(htmltab)
library(tsbox)
library(stR)
library(quantmod)
library(tseries)
library(forecast)
library(dygraphs)
library(leaflet)
library(ggplot2)
library(plotly)
library(readxl)
data <- read.csv('~/R/AQI prediction/data2.csv')
data$Date<-as.Date(data$Date)
AQI <- read.csv('~/R/AQI prediction/Data/city_day.csv')
AQID<-dplyr::filter(AQI,City=="Delhi")
AQIM<-dplyr::filter(AQI,City=="Mumbai")
AQIC<-dplyr::filter(AQI,City=="Chennai")
AQID<-AQID[,c("Date","AQI")]
AQID$Date<-as.Date(AQID$Date,format="%d-%m-%Y")
AQIM<-AQIM[,c("Date","AQI")]
AQIM$Date<-as.Date(AQIM$Date,format="%d-%m-%Y")
AQIC<-AQIC[,c("Date","AQI")]
AQIC$Date<-as.Date(AQIC$Date,format="%d-%m-%Y")
AQID.ts<-ts(AQID$AQI,start=c(2015,1),frequency=365)
AQID.ts<-na.approx(AQID.ts)
AQIM.ts<-ts(AQIM$AQI,start=c(2015,1),frequency=365)
AQIM.ts<-na.approx(AQIM.ts)
AQIC.ts<-ts(AQIC$AQI,start=c(2015,1),frequency=365)
AQIC.ts<-na.approx(AQIC.ts)
```
```{r , include=FALSE}
worldPM1025 <- read_excel("aap_air_quality_database_2018_v14.xlsx", sheet = "database",
col_types = c("text","skip", "text", "skip", "skip", "text", "skip", "skip",
"text", "skip", "skip", "skip", "skip", "skip", "skip"), skip = 2)
#Source: WHO Global Ambient Air Quality Database - update 2018
#Preprocessing
names(worldPM1025)<-c("Region", "Country","PM10","PM2.5")
worldPM1025$PM10<-stringr::str_replace(worldPM1025$PM10,"-converted value","")
worldPM1025$PM10<-stringr::str_replace(worldPM1025$PM10,"[)]","")
worldPM1025$PM10<-stringr::str_replace(worldPM1025$PM10,"[(]","")
worldPM1025$PM2.5<-stringr::str_replace(worldPM1025$PM2.5,"-converted value","")
worldPM1025$PM2.5<-stringr::str_replace(worldPM1025$PM2.5,"[)]","")
worldPM1025$PM2.5<-stringr::str_replace(worldPM1025$PM2.5,"[(]","")
worldPM1025$PM2.5<-as.numeric(worldPM1025$PM2.5)
worldPM1025$PM10<-as.numeric(worldPM1025$PM10)
World2.5<-aggregate(x=worldPM1025$PM2.5,by=list(Country=worldPM1025$Country,Region=worldPM1025$Region),FUN = mean)
World10<-aggregate(x=worldPM1025$PM10,by=list(Country=worldPM1025$Country,Region=worldPM1025$Region),FUN = mean)
#Source:https://www.kaggle.com/paultimothymooney/latitude-and-longitude-for-every-country-and-state
countries <- read.csv("~/countries.csv")
names(countries)<-c("code","lat","lon","Country")
mergeddata2.5<-merge(World2.5,countries,type="left")
mergeddata10<-merge(World10,countries,type="left")
# Create a palette that maps factor levels to colors
pal <- colorNumeric(palette = "YlOrRd", domain = mergeddata2.5$x)
pal2 <- colorNumeric(palette = "YlOrRd", domain = mergeddata10$x)
```
```{r, include = FALSE}
o=data.frame(na.omit(AQI[,c(1,5:9)]))
fin<-aggregate(o,by=list(o$City),FUN=mean)
fin<-na.omit(fin[,-2])
names(fin)=c("city","NO","NO2","NOx","NH3","CO")
indlatlon <- read.csv("~/in.csv")
indiaf<-merge(fin,indlatlon,type="inner",by="city")
```
Overview {data-navmenu="Intro"}
=====================================
Column
-------------------------------------
### Overview
The Problem
Climate change is a very big problem in contemporary times, and India, as the second most populous country in the world, contributes significantly to air pollution with many cities in the top 50 most polluted cities in the world. But the extent of pollution is not properly studied and used in decision making in India. Air quality monitoring stations are present in many cities but the data isn’t used for predictions in most cases, but rather as an observation. Another problem is that these AQI monitoring stations aren’t present in small cities. The project aims to build a forecasting model to predict AQI. The project focuses on the city of Indian cities like Delhi, Mumbai and Chennai which are infamous for their pollution levels.
### India NOx pollution
```{r}
indiafplot3<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NOx)
pal5 <- colorNumeric(palette = "YlOrRd", domain = indiafplot3$x)
leaflet(indiafplot3) %>% addTiles() %>%
addCircleMarkers(
radius = ~x,
color = ~pal5(x),
stroke = FALSE, fillOpacity = 0.5
)
```
Column
-------------------------------------
### Pollution level by continents and income (PM 10)
```{r}
library(plotly)
fig <- plot_ly(World10, y = ~x, color = ~Region, type = "box")
fig
```
### Polution levels in different countries (PM 10)
```{r}
leaflet(mergeddata10) %>% addTiles() %>%
addCircleMarkers(
radius = ~x/5,
color = ~pal2(x),
stroke = FALSE, fillOpacity = 0.5
)
```
Data Summary {data-navmenu="Intro"}
=====================================
Column
-------------------------------------
### Meteorological data
Meteorological data
The Meteorological data was scrapped from tutiempo.com. The coding for the scrapper is given with the file.The values collected were Temperature, Purssure, Humidity and Wind Speed.
### AQI Data
AQI Data
The AQI data was downloaded from Kaggle. The data set has AQI, CO, NO, NOx, NH3, SO2, PM2.5, PM10 emissions measured in micro grams per meter cube.
### World pollution Data
World pollution Data
This data was downloaded from WHO Global Ambient Air Quality Database - update 2018.It had locationwise pollution data for countries
Column
-------------------------------------
### India pollution Data
India pollution Data
The AQI data was manipulated and grouped to obtain mean pollution across various cities in India.
### Latitude Longitude countries
Latitude Longitude countries
This Data was downloaded from Kaggle.Since goggle's ggmap is no more free. This was a better option than buying an API.
### Latitude Longitude Indian cities
Latitude Longitude Indian cities
This data had location (geocode) of different Indian cities. This is matched with existing pollution data for visualizations.
World Pollution Levels {data-navmenu="EDA"}
=========================================
Column
-------------------------------------
### Plot Inference
Pollution Levels around the world
The plots on the side reveal how developing countries (Low Income Mediterranean, South East Asia) contribute most to the particulate pollution. Moreover the other plots reveal how India compares to the rest of the world
LMIC: Low Middle Income Countries
HIC: High Income Countries
### PM 2.5 across countries mapped
```{r}
leaflet(mergeddata2.5) %>% addTiles() %>%
addCircleMarkers(
radius = ~x/5,
color = ~pal(x),
stroke = FALSE, fillOpacity = 0.5
)
```
### PM 10 across countries mapped
```{r}
leaflet(mergeddata10) %>% addTiles() %>%
addCircleMarkers(
radius = ~x/5,
color = ~pal2(x),
stroke = FALSE, fillOpacity = 0.5
)
```
Column
-------------------------------------
### PM 10 across continents
```{r}
library(plotly)
fig <- plot_ly(World10, y = ~x, color = ~Region, type = "box")
fig
```
### PM 2.5 across continents
```{r}
library(plotly)
fig1 <- plot_ly(World2.5, y = ~x, color = ~Region, type = "box")
fig1
```
India Pollution Levels{ data-navmenu="EDA"}
=========================================
Column
-------------------------------------
### Plot Inference
Pollution Levels in India
The plots on the side show the amounts of pollutants and their spread across various cities in India. Particulate Matter is not shown here. The pollutants shown here make up significant portion of the Air Quality Index which is a widely used measure. We will be using the AQI as well in this project
### CO
```{r}
indiafplot1<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$CO)
pal3 <- colorNumeric(palette = "YlOrRd", domain = indiafplot1$x)
leaflet(indiafplot1) %>% addTiles() %>%
addCircleMarkers(
radius = ~x*15,
color = ~pal3(x),
stroke = FALSE, fillOpacity = 0.5
)
```
### NO2
```{r}
indiafplot2<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NO2)
pal4 <- colorNumeric(palette = "YlOrRd", domain = indiafplot2$x)
leaflet(indiafplot2) %>% addTiles() %>%
addCircleMarkers(
radius = ~x,
color = ~pal4(x),
stroke = FALSE, fillOpacity = 0.5
)
```
Column
-------------------------------------
### NOx
```{r}
indiafplot3<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NOx)
pal5 <- colorNumeric(palette = "YlOrRd", domain = indiafplot3$x)
leaflet(indiafplot3) %>% addTiles() %>%
addCircleMarkers(
radius = ~x,
color = ~pal5(x),
stroke = FALSE, fillOpacity = 0.5
)
```
### NO
```{r}
indiafplot4<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NO)
pal6 <- colorNumeric(palette = "YlOrRd", domain = indiafplot4$x)
leaflet(indiafplot4) %>% addTiles() %>%
addCircleMarkers(
radius = ~x/2,
color = ~pal6(x),
stroke = FALSE, fillOpacity = 0.5
)
```
### NH3
```{r}
indiafplot5<-data.frame(lat=indiaf$lat,lon=indiaf$lng,x=indiaf$NH3)
pal7 <- colorNumeric(palette = "YlOrRd", domain = indiafplot5$x)
leaflet(indiafplot5) %>% addTiles() %>%
addCircleMarkers(
radius = ~x/2,
color = ~pal7(x),
stroke = FALSE, fillOpacity = 0.5
)
```
Delhi{data-navmenu="Meterological Trends"}
=========================================
Column
-------------------------------------
### Delhi
```{r}
m <- leaflet() %>%
addTiles() %>%
addMarkers(lng=77.1025, lat=28.7041, popup="Delhi")
m
```
### Temperature
```{r}
datadelhi <- read.csv("datadelhi.csv")
datadelhi$Date<-as.Date(datadelhi$Date)
x <- datadelhi$Date
y<-datadelhi$temp
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Degrees"))
fig
```
Column
-------------------------------------
### Pressure
```{r}
x <- datadelhi$Date
y<-datadelhi$pressure
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("hPa"))
fig
```
### Humidity
```{r}
x <- datadelhi$Date
y<-datadelhi$humidity[datadelhi$humidity<1000]
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("%"))
fig
```
### Wind
```{r}
x <- datadelhi$Date
y<-datadelhi$wind
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Km/h"))
fig
```
Mumbai{data-navmenu="Meterological Trends"}
=========================================
Column
-------------------------------------
### Mumbai
```{r}
m <- leaflet() %>%
addTiles() %>%
addMarkers(lng=72.8777, lat=19.0760, popup="Mumbai")
m
```
### Temperature
```{r}
datamumbai <- read.csv("datamumbai.csv")
datamumbai$Date<-as.Date(datamumbai$Date)
x <- datamumbai$Date
y<-datamumbai$temp
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Degrees"))
fig
```
Column
-------------------------------------
### Pressure
```{r}
x <- datamumbai$Date[datamumbai$pressure<1100]
y<-datamumbai$pressure[datamumbai$pressure<1100]
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("hPa"))
fig
```
### Humidity
```{r}
x <- datamumbai$Date
y<-datamumbai$humidity
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("%"))
fig
```
### Wind
```{r}
x <- datamumbai$Date
y<-datamumbai$wind
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Km/h"))
fig
```
Chennai{data-navmenu="Meterological Trends"}
=========================================
Column
-------------------------------------
### Chennai
```{r}
m <- leaflet() %>%
addTiles() %>%
addMarkers(lng=80.2707, lat=13.0827, popup="Chennai")
m
```
### Temperature
```{r}
datachennai <- read.csv("datachennai.csv")
datachennai$Date<-as.Date(datachennai$Date)
x <- datachennai$Date
y<-datachennai$temp
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Degrees"))
fig
```
Column
-------------------------------------
### Pressure
```{r}
x <- datachennai$Date
y<-datachennai$pressure
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("hPa"))
fig
```
### Humidity
```{r}
x <- datachennai$Date
y<-datachennai$humidity
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("%"))
fig
```
### Wind
```{r}
x <- datachennai$Date[datachennai$wind<22 ]
y<-datachennai$wind[datachennai$wind<22 ]
fig <- plot_ly(x = ~x, y = ~y, mode = 'lines', text = paste("Km/h"))
fig
```
Methodology {data-navmenu="Time Series Analysis"}
=========================================
Column
-------------------------------------
### ACF/ PACF
Meteorological data
By looking at the autocorrelation function (ACF) and partial autocorrelation (PACF) plots we can tentatively identify the numbers of AR and/or MA terms that are needed.ACF plot is merely a bar chart of the coefficients of correlation between a time series and lags of itself. The PACF plot is a plot of the partial correlation coefficients between the series and lags of itself.
### Model Fitting
AQI Data
We have used the algorithm proposed by Hyndman & Khandakarin 2008 which is implemented through auto.arima in R. This simplifies the process but increases time complexity. Since we have a relatively small data set we have gone ahead with this.
### SARIMA
SARIMA
SARIMA is a Integrated AR, MA model fitted for daily/monthly values as well as an additional seasonal term. We have fitted the model using auto.arima.
Column
-------------------------------------
### SARIMAX
SARIMAX
The SARIMAX model has some regressors as well, to model the variable change with time. This model was fitted using auto.arima but the results were unfavorable due to missing values and compounding of errors.A Distributed Lag model was also checked but gave worse results.
### Ljung Box Test
Ljung Box Test
The Ljung Box test of the residuals show if the residuals are white noise or if they have predictability. The squared residual plot checks conditional heteroscedasticity.
### Model Selection
Model Selections
There are various error measures that can be used for model selection - AIC, BIC, MSE, MAE, MAPE, etc. We have used MAPE. We have chosen the most robust model as that is also an important criteria in selection.
ACF/PACF {data-navmenu="Time Series Analysis"}
=========================================
### Importance {.tabset}
-------------------------------------
### ACF Delhi
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQID.ts))
bacf <- pacf(AQID.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
### PACF Delhi
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQID.ts))
bacf <- acf(AQID.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
### ACF Mumbai
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIM.ts))
bacf <- pacf(AQIM.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
### PACF Mumbai
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIM.ts))
bacf <- acf(AQIM.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
### ACF Chennai
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIC.ts))
bacf <- pacf(AQIC.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
### PACF Chennai
```{r}
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(AQIC.ts))
bacf <- acf(AQIC.ts, plot = FALSE)
bacfdf <- with(bacf, data.frame(lag, acf))
library(ggplot2)
q <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) +
geom_bar(stat = "identity", position = "identity")
q
```
SARIMA {data-navmenu="Time Series Analysis"}
=========================================
Column
-------------------------------------
### Summary
Fitted Models
Delhi: SARIMA(1,1,2)(0,1,0)
Mumbai: SARIMA(1,0,0)(0,1,0)
Chennai: SARIMA(4,1,3)
The Chennai model has a lot of missing values and hence cannot be predicted properly, hence the apparent lack of seasonality. We drop Chennai from our analysis.
### Delhi
```{r}
set.seed(1000)
ARIMAXD.out=auto.arima(y = AQID.ts)
forcastD<-forecast(object = ARIMAXD.out,h = 365)
Time0 <-time(forcastD$mean)
AQIDelhi0<-as.numeric(forcastD$mean)
figD <- plot_ly(x = ~Time0, y = ~AQIDelhi0, mode = 'lines',type = "scatter")
#forcastD
library(plotly)
trace1 <- list(
line = list(
color = "rgba(0,0,0,1)",
fillcolor = "rgba(0,0,0,1)"
),
mode = "lines",
name = "observed",
type = "scatter",
x = time(forcastD$fitted),
y = as.numeric(forcastD$fitted),
xaxis = "x",
yaxis = "y"
)
trace2 <- list(
fill = "toself",
line = list(
color = "rgba(242,242,242,1)",
fillcolor = "rgba(242,242,242,1)"
),
mode = "lines",
name = "95% confidence",
type = "scatter",
x = append(time(forcastD$upper),(append((rev(time(forcastD$upper))[1]),rev(time(forcastD$upper))))),
y = append(as.numeric(forcastD$lower[,2]),append((rev(as.numeric(forcastD$upper[,2]))[1]),rev(as.numeric(forcastD$upper[,2])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace3 <- list(
fill = "toself",
line = list(
color = "rgba(204,204,204,1)",
fillcolor = "rgba(204,204,204,1)"
),
mode = "lines",
name = "80% confidence",
type = "scatter",
x = append(time(forcastD$upper),(append((rev(time(forcastD$upper))[1]),rev(time(forcastD$upper))))),
y = append(as.numeric(forcastD$lower[,1]),append((rev(as.numeric(forcastD$upper[,1]))[1]),rev(as.numeric(forcastD$upper[,1])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace4 <- list(
line = list(
color = "rgba(0,0,255,1)",
fillcolor = "rgba(0,0,255,1)"
),
mode = "lines",
name = "prediction",
type = "scatter",
x = time(forcastD$mean),
y = as.numeric(forcastD$mean),
xaxis = "x",
yaxis = "y"
)
data <- list(trace1, trace2, trace3, trace4)
layout <- list(
title = "Forecast from Auto.Arima",
xaxis = list(
title = "Year",
domain = c(0, 1)
),
yaxis = list(
title = "SAR",
domain = c(0, 1)
),
margin = list(
b = 40,
l = 60,
r = 10,
t = 25
)
)
p <- plot_ly()
p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis)
p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron)
p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron)
p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis)
p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin)
p
```
Column
-------------------------------------
### Mumbai
```{r}
set.seed(1000)
ARIMAXM.out=auto.arima(y = AQIM.ts)
forcastM<-forecast(object = ARIMAXM.out,h = 365)
Time1<-time(forcastM$mean)
AQIMumbai0<-as.numeric(forcastM$mean)
figM <- plot_ly(x = ~Time1, y = ~AQIMumbai0, mode = 'lines',type = "scatter")
#forcastM
library(plotly)
trace1 <- list(
line = list(
color = "rgba(0,0,0,1)",
fillcolor = "rgba(0,0,0,1)"
),
mode = "lines",
name = "observed",
type = "scatter",
x = time(forcastM$fitted),
y = as.numeric(forcastM$fitted),
xaxis = "x",
yaxis = "y"
)
trace2 <- list(
fill = "toself",
line = list(
color = "rgba(242,242,242,1)",
fillcolor = "rgba(242,242,242,1)"
),
mode = "lines",
name = "95% confidence",
type = "scatter",
x = append(time(forcastM$upper),(append((rev(time(forcastM$upper))[1]),rev(time(forcastM$upper))))),
y = append(as.numeric(forcastM$lower[,2]),append((rev(as.numeric(forcastM$upper[,2]))[1]),rev(as.numeric(forcastM$upper[,2])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace3 <- list(
fill = "toself",
line = list(
color = "rgba(204,204,204,1)",
fillcolor = "rgba(204,204,204,1)"
),
mode = "lines",
name = "80% confidence",
type = "scatter",
x = append(time(forcastM$upper),(append((rev(time(forcastM$upper))[1]),rev(time(forcastM$upper))))),
y = append(as.numeric(forcastM$lower[,1]),append((rev(as.numeric(forcastM$upper[,1]))[1]),rev(as.numeric(forcastM$upper[,1])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace4 <- list(
line = list(
color = "rgba(0,0,255,1)",
fillcolor = "rgba(0,0,255,1)"
),
mode = "lines",
name = "prediction",
type = "scatter",
x = time(forcastM$mean),
y = as.numeric(forcastM$mean),
xaxis = "x",
yaxis = "y"
)
data <- list(trace1, trace2, trace3, trace4)
layout <- list(
title = "Forecast from Auto.Arima",
xaxis = list(
title = "Year",
domain = c(0, 1)
),
yaxis = list(
title = "SAR",
domain = c(0, 1)
),
margin = list(
b = 40,
l = 60,
r = 10,
t = 25
)
)
p <- plot_ly()
p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis)
p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron)
p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron)
p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis)
p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin)
p
```
### Chennai
```{r}
set.seed(1000)
ARIMAXC.out=auto.arima(y = AQIC.ts)
forcastC<-forecast(object = ARIMAXC.out,h = 365)
x1 <-seq( from =as.Date("2020-01-08"),to = as.Date("2021-01-08"),length.out = 365)
y1<-as.numeric(forcastC$mean)
fig <- plot_ly(x = ~x1, y = ~y1, mode = 'lines',type = "scatter")
#forcastC
library(plotly)
trace1 <- list(
line = list(
color = "rgba(0,0,0,1)",
fillcolor = "rgba(0,0,0,1)"
),
mode = "lines",
name = "observed",
type = "scatter",
x = time(forcastC$fitted),
y = as.numeric(forcastC$fitted),
xaxis = "x",
yaxis = "y"
)
trace2 <- list(
fill = "toself",
line = list(
color = "rgba(242,242,242,1)",
fillcolor = "rgba(242,242,242,1)"
),
mode = "lines",
name = "95% confidence",
type = "scatter",
x = append(time(forcastC$upper),(append((rev(time(forcastC$upper))[1]),rev(time(forcastC$upper))))),
y = append(as.numeric(forcastC$lower[,2]),append((rev(as.numeric(forcastC$upper[,2]))[1]),rev(as.numeric(forcastC$upper[,2])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace3 <- list(
fill = "toself",
line = list(
color = "rgba(204,204,204,1)",
fillcolor = "rgba(204,204,204,1)"
),
mode = "lines",
name = "80% confidence",
type = "scatter",
x = append(time(forcastC$upper),(append((rev(time(forcastC$upper))[1]),rev(time(forcastC$upper))))),
y = append(as.numeric(forcastC$lower[,1]),append((rev(as.numeric(forcastC$upper[,1]))[1]),rev(as.numeric(forcastC$upper[,1])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace4 <- list(
line = list(
color = "rgba(0,0,255,1)",
fillcolor = "rgba(0,0,255,1)"
),
mode = "lines",
name = "prediction",
type = "scatter",
x = time(forcastC$mean),
y = as.numeric(forcastC$mean),
xaxis = "x",
yaxis = "y"
)
data <- list(trace1, trace2, trace3, trace4)
layout <- list(
title = "Forecast from Auto.Arima",
xaxis = list(
title = "Year",
domain = c(0, 1)
),
yaxis = list(
title = "SAR",
domain = c(0, 1)
),
margin = list(
b = 40,
l = 60,
r = 10,
t = 25
)
)
p <- plot_ly()
p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis)
p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron)
p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron)
p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis)
p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin)
p
```
Model Results SARIMA { data-navmenu="Time Series Analysis"}
===================================================================
### Importance {.tabset}
-------------------------------------
### Delhi
```{r}
print("Delhi")
summary(ARIMAXD.out)
```
### Mumbai
```{r}
print("Mumbai")
summary(ARIMAXM.out)
```
### Chennai
```{r}
print("Chennai")
summary(ARIMAXC.out)
```
SARIMAX {data-navmenu="Time Series Analysis"}
=========================================
Column
-------------------------------------
### Summary
Fitted Models
Delhi: SARIMAX(1,0,0)
Mumbai: SARIMA(2,1,2)
Chennai: SARIMA(1,1,1)
These ARIMAX models are done using all the meteorological variables. This gives models with larger MAPE than the ARIMA model without an X variable/ matrix. This can be due to high deviations in the meteorological variables. A distributed lagged model was also fitted to check. ARIMA with no X variable is the best in any case.
### Delhi
```{r}
datadelhi <- read.csv("datadelhi.csv")
datadelhi$Date<-as.Date(datadelhi$Date)
ydl=as.ts(AQIM.ts[1:1500])
xdl=as.matrix(datadelhi[1:1500,-1])
ARIMAXD2.out=auto.arima(y = ydl, xreg = xdl)
forcastdlD<-forecast(object = ARIMAXD2.out,h = 365,xreg=as.matrix(datadelhi[1501:1865,-1]))
#forcastdlD
library(plotly)
trace1 <- list(
line = list(
color = "rgba(0,0,0,1)",
fillcolor = "rgba(0,0,0,1)"
),
mode = "lines",
name = "observed",
type = "scatter",
x = time(forcastdlD$fitted),
y = as.numeric(forcastdlD$fitted),
xaxis = "x",
yaxis = "y"
)
trace2 <- list(
fill = "toself",
line = list(
color = "rgba(242,242,242,1)",
fillcolor = "rgba(242,242,242,1)"
),
mode = "lines",
name = "95% confidence",
type = "scatter",
x = append(time(forcastdlD$upper),(append((rev(time(forcastdlD$upper))[1]),rev(time(forcastdlD$upper))))),
y = append(as.numeric(forcastdlD$lower[,2]),append((rev(as.numeric(forcastdlD$upper[,2]))[1]),rev(as.numeric(forcastdlD$upper[,2])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace3 <- list(
fill = "toself",
line = list(
color = "rgba(204,204,204,1)",
fillcolor = "rgba(204,204,204,1)"
),
mode = "lines",
name = "80% confidence",
type = "scatter",
x = append(time(forcastdlD$upper),(append((rev(time(forcastdlD$upper))[1]),rev(time(forcastdlD$upper))))),
y = append(as.numeric(forcastdlD$lower[,1]),append((rev(as.numeric(forcastdlD$upper[,1]))[1]),rev(as.numeric(forcastdlD$upper[,1])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace4 <- list(
line = list(
color = "rgba(0,0,255,1)",
fillcolor = "rgba(0,0,255,1)"
),
mode = "lines",
name = "prediction",
type = "scatter",
x = time(forcastdlD$mean),
y = as.numeric(forcastdlD$mean),
xaxis = "x",
yaxis = "y"
)
data <- list(trace1, trace2, trace3, trace4)
layout <- list(
title = "Forecast from Auto.Arima",
xaxis = list(
title = "Year",
domain = c(0, 1)
),
yaxis = list(
title = "SAR",
domain = c(0, 1)
),
margin = list(
b = 40,
l = 60,
r = 10,
t = 25
)
)
p <- plot_ly()
p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis)
p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron)
p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron)
p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis)
p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin)
p
```
Column
-------------------------------------
### Mumbai
```{r}
datamumbai <- read.csv("datamumbai.csv")
datamumbai$Date<-as.Date(datamumbai$Date)
ydl=as.ts(AQIM.ts[1:1500])
xdl=as.matrix(datamumbai[1:1500,-1])
ARIMAXM2.out=auto.arima(y = ydl, xreg = xdl)
forcastdlM<-forecast(object = ARIMAXM2.out,h = 365,xreg=as.matrix(datamumbai[1501:1865,-1]))
#forcastdlM
library(plotly)
trace1 <- list(
line = list(
color = "rgba(0,0,0,1)",
fillcolor = "rgba(0,0,0,1)"
),
mode = "lines",
name = "observed",
type = "scatter",
x = time(forcastdlM$fitted),
y = as.numeric(forcastdlM$fitted),
xaxis = "x",
yaxis = "y"
)
trace2 <- list(
fill = "toself",
line = list(
color = "rgba(242,242,242,1)",
fillcolor = "rgba(242,242,242,1)"
),
mode = "lines",
name = "95% confidence",
type = "scatter",
x = append(time(forcastdlM$upper),(append((rev(time(forcastdlM$upper))[1]),rev(time(forcastdlM$upper))))),
y = append(as.numeric(forcastdlM$lower[,2]),append((rev(as.numeric(forcastdlM$upper[,2]))[1]),rev(as.numeric(forcastdlM$upper[,2])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace3 <- list(
fill = "toself",
line = list(
color = "rgba(204,204,204,1)",
fillcolor = "rgba(204,204,204,1)"
),
mode = "lines",
name = "80% confidence",
type = "scatter",
x = append(time(forcastdlM$upper),(append((rev(time(forcastdlM$upper))[1]),rev(time(forcastdlM$upper))))),
y = append(as.numeric(forcastdlM$lower[,1]),append((rev(as.numeric(forcastdlM$upper[,1]))[1]),rev(as.numeric(forcastdlM$upper[,1])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace4 <- list(
line = list(
color = "rgba(0,0,255,1)",
fillcolor = "rgba(0,0,255,1)"
),
mode = "lines",
name = "prediction",
type = "scatter",
x = time(forcastdlM$mean),
y = as.numeric(forcastdlM$mean),
xaxis = "x",
yaxis = "y"
)
data <- list(trace1, trace2, trace3, trace4)
layout <- list(
title = "Forecast from Auto.Arima",
xaxis = list(
title = "Year",
domain = c(0, 1)
),
yaxis = list(
title = "SAR",
domain = c(0, 1)
),
margin = list(
b = 40,
l = 60,
r = 10,
t = 25
)
)
p <- plot_ly()
p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis)
p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron)
p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron)
p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis)
p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin)
p
```
### Chennai
```{r}
datachennai <- read.csv("datachennai.csv")
datachennai$Date<-as.Date(datachennai$Date)
ydl=as.ts(AQIC.ts[1:1500])
xdl=as.matrix(datachennai[1:1500,-1])
ARIMAXC2.out=auto.arima(y = ydl, xreg = xdl)
forcastdlC<-forecast(object = ARIMAXC2.out,h = 365,xreg=as.matrix(datachennai[1501:1865,-1]))
#forcastdlC
library(plotly)
trace1 <- list(
line = list(
color = "rgba(0,0,0,1)",
fillcolor = "rgba(0,0,0,1)"
),
mode = "lines",
name = "observed",
type = "scatter",
x = time(forcastdlC$fitted),
y = as.numeric(forcastdlC$fitted),
xaxis = "x",
yaxis = "y"
)
trace2 <- list(
fill = "toself",
line = list(
color = "rgba(242,242,242,1)",
fillcolor = "rgba(242,242,242,1)"
),
mode = "lines",
name = "95% confidence",
type = "scatter",
x = append(time(forcastdlC$upper),(append((rev(time(forcastdlC$upper))[1]),rev(time(forcastdlC$upper))))),
y = append(as.numeric(forcastdlC$lower[,2]),append((rev(as.numeric(forcastdlC$upper[,2]))[1]),rev(as.numeric(forcastdlC$upper[,2])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace3 <- list(
fill = "toself",
line = list(
color = "rgba(204,204,204,1)",
fillcolor = "rgba(204,204,204,1)"
),
mode = "lines",
name = "80% confidence",
type = "scatter",
x = append(time(forcastdlC$upper),(append((rev(time(forcastdlC$upper))[1]),rev(time(forcastdlC$upper))))),
y = append(as.numeric(forcastdlC$lower[,1]),append((rev(as.numeric(forcastdlC$upper[,1]))[1]),rev(as.numeric(forcastdlC$upper[,1])))),
xaxis = "x",
yaxis = "y",
hoveron = "points"
)
trace4 <- list(
line = list(
color = "rgba(0,0,255,1)",
fillcolor = "rgba(0,0,255,1)"
),
mode = "lines",
name = "prediction",
type = "scatter",
x = time(forcastdlC$mean),
y = as.numeric(forcastdlC$mean),
xaxis = "x",
yaxis = "y"
)
data <- list(trace1, trace2, trace3, trace4)
layout <- list(
title = "Forecast from Auto.Arima",
xaxis = list(
title = "Year",
domain = c(0, 1)
),
yaxis = list(
title = "SAR",
domain = c(0, 1)
),
margin = list(
b = 40,
l = 60,
r = 10,
t = 25
)
)
p <- plot_ly()
p <- add_trace(p, line=trace1$line, mode=trace1$mode, name=trace1$name, type=trace1$type, x=trace1$x, y=trace1$y, xaxis=trace1$xaxis, yaxis=trace1$yaxis)
p <- add_trace(p, fill=trace2$fill, line=trace2$line, mode=trace2$mode, name=trace2$name, type=trace2$type, x=trace2$x, y=trace2$y, xaxis=trace2$xaxis, yaxis=trace2$yaxis, hoveron=trace2$hoveron)
p <- add_trace(p, fill=trace3$fill, line=trace3$line, mode=trace3$mode, name=trace3$name, type=trace3$type, x=trace3$x, y=trace3$y, xaxis=trace3$xaxis, yaxis=trace3$yaxis, hoveron=trace3$hoveron)
p <- add_trace(p, line=trace4$line, mode=trace4$mode, name=trace4$name, type=trace4$type, x=trace4$x, y=trace4$y, xaxis=trace4$xaxis, yaxis=trace4$yaxis)
p <- layout(p, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis, margin=layout$margin)
p
```
Model Results SARIMAX { data-navmenu="Time Series Analysis"}
===================================================================
### Importance {.tabset}
-------------------------------------
### Delhi
```{r}
print("Delhi")
summary(ARIMAXD2.out)
```
### Mumbai
```{r}
print("Mumbai")
summary(ARIMAXM2.out)
```
### Chennai
```{r}
print("Chennai")
summary(ARIMAXC2.out)
```
Ljung - Box Residual Test { data-navmenu="Time Series Analysis"}
===================================================================
### Importance {.tabset}
-------------------------------------
### Delhi
```{r}
Box.test(ARIMAXD.out$residuals,lag=10,type="Ljung")
Box.test(x = ARIMAXD.out$residuals^2,lag=1,type="Ljung")
```
### Mumbai
```{r}
Box.test(ARIMAXM.out$residuals,lag=10,type="Ljung")
Box.test(x = ARIMAXM.out$residuals^2,lag=1,type="Ljung")
```
### Chennai
```{r}
Box.test(ARIMAXC.out$residuals,lag=10,type="Ljung")
Box.test(x = ARIMAXC.out$residuals^2,lag=1,type="Ljung")
```
Comparison & Results {data-navmenu="Time Series Analysis"}
===================================================================
Column
-------------------------------------
### Conclusion
Conclusion
In this project we have identified the SARIMA function using Hyndman & Khandakar algorithm in auto.arima to be the best model for our problem. Due to the limitations of computing power and relatively low quality of data, we were only able to forecast for Delhi and Mumbai. With distributed computing, this can be overcome. The model can be automated and scaled with better quality data to make an interactive dashboard for policy makers to make decisions and to track climate change and fluctuations. Technologies like HDFS, Map Reduce with R can be on a cloud service provider used for scaling.
Thank You
### Delhi Final Prediction
```{r}
figD
```
Column
-------------------------------------
### Limitations
Limitations
The Ljung Box test confirmed conditional heteroscedasticity. GARCH models were fitted but the standard deviations were too high. The very nature of this data is highly volatile. Therefore we have obtained a reasonable prediction. Usage of a fourier curve for the seasonal component or modern techniques like LSTM Neural Nets might give better results.
### Mumbai Final Prediction
```{r}
figM
```
Web Scrapper
=========================================
### Importance {.tabset}
-------------------------------------
### Page 1
```{r, eval=FALSE, echo=TRUE}
# This code was used to scrape meterological data from the web
data<-NULL
for ( year in 2019:2020)
{
for(month in 1:12)
{
monthname<-c('january','february','march','april','may','june','july','august','september','october','november','december')
if (monthname[month] %in% c('january','march','may','july','august','october','december')){j=31}
else {j=30}
if(month==2&year%%4==0){j=29}
else if (month==2&year%%4!=0){j=28}
for(day in 1:j)
{
url<-read_html(paste0('https://en.tutiempo.net/records/vidp/',day,'-',monthname[month],'-',year,'.html'))
temp.raw<-url%>%
html_nodes('.Temp')%>%
html_text()
temp.raw<-gsub('[^\x20-\x7E]', '', temp.raw)
humid.raw<-url%>%
html_nodes('.hr')%>%
html_text()
humid.raw<-str_remove(humid.raw,'%')
wind.raw<-url%>%
html_nodes('.wind')%>%
html_text()
```
### Page 2
```{r , eval=FALSE, echo=TRUE}
wind.raw<-str_remove(wind.raw,' km/h')
wind.raw<-na.omit(str_extract(wind.raw, '[[:digit:]]+'))
pressure.raw<-url%>%
html_nodes('.prob')%>%
html_text()
pressure.raw<-str_remove(pressure.raw,'hPa')
humidity<-mean(na.omit(as.integer(humid.raw)))
temp<-mean(na.omit(as.integer(temp.raw)))
pressure<-mean(na.omit(as.integer(pressure.raw)))
wind<-mean(na.omit(as.integer(wind.raw)))
Date<-as.Date(paste0(year,'-',month,'-',day))
datatemp<-data.frame(Date,temp,pressure,humidity,wind)
data<-rbind(data,datatemp)
}
}
nametemp<-paste('data',year,sep='_')
assign(nametemp,data)
}
data2<-rbind(data_2015,data_2016,data_2017,data_2018,data_2019,data_2020)
write.csv(data2,'data2.csv', row.names = FALSE)
```
Credits
=========================================
Column
-------------------------------------
### Group No
Group No
28
### Member
Wellington Daniel
2016IPM118
### Member
Oshin Singhal
2016IPM070
Column
-------------------------------------
### Section
Section
A
### Member
Shashank Giri
2016IPM094
### Member
Aditya Badonia
2016IPM004